First we need to load some libraries.
First we will load the data from our csv file.As we are search the average number of new cases and deaths over the each month, so the Date column must be converted into three sub columns(Day/Month/Year) as needed for our analysis.
Have looked for the few rows of our data
Date Name.of.State...UT Latitude Longitude Total.Confirmed.cases Death
1 2020-01-30 Kerala 10.8505 76.2711 1 0
2 2020-01-31 Kerala 10.8505 76.2711 1 0
3 2020-02-01 Kerala 10.8505 76.2711 2 0
4 2020-02-02 Kerala 10.8505 76.2711 3 0
5 2020-02-03 Kerala 10.8505 76.2711 3 0
6 2020-02-04 Kerala 10.8505 76.2711 3 0
Cured.Discharged.Migrated New.cases New.deaths New.recovered
1 0 0 0 0
2 0 0 0 0
3 0 1 0 0
4 0 1 0 0
5 0 0 0 0
6 0 0 0 0
The summary of the data
Date Name.of.State...UT Latitude Longitude
Length:4692 Length:4692 Min. : 0.00 Min. : 0.00
Class :character Class :character 1st Qu.:18.11 1st Qu.:76.27
Mode :character Mode :character Median :23.94 Median :79.02
Mean :23.19 Mean :81.45
3rd Qu.:28.22 3rd Qu.:85.31
Max. :34.30 Max. :94.73
Total.Confirmed.cases Death Cured.Discharged.Migrated
Min. : 1 Length:4692 Min. : 0.0
1st Qu.: 39 Class :character 1st Qu.: 9.0
Median : 619 Mode :character Median : 197.5
Mean : 11394 Mean : 6908.1
3rd Qu.: 5233 3rd Qu.: 2736.0
Max. :468265 Max. :305521.0
New.cases New.deaths New.recovered
Min. : 0.0 Min. :0 Min. : -1.0
1st Qu.: 1.0 1st Qu.:0 1st Qu.: 0.0
Median : 26.0 Median :0 Median : 8.0
Mean : 418.6 Mean :0 Mean : 283.1
3rd Qu.: 210.2 3rd Qu.:0 3rd Qu.: 119.0
Max. :18366.0 Max. :0 Max. :13401.0
The Structure of the data.
'data.frame': 4692 obs. of 10 variables:
$ Date : chr "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
$ Name.of.State...UT : chr "Kerala" "Kerala" "Kerala" "Kerala" ...
$ Latitude : num 10.9 10.9 10.9 10.9 10.9 ...
$ Longitude : num 76.3 76.3 76.3 76.3 76.3 ...
$ Total.Confirmed.cases : num 1 1 2 3 3 3 3 3 3 3 ...
$ Death : chr "0" "0" "0" "0" ...
$ Cured.Discharged.Migrated: num 0 0 0 0 0 0 0 0 0 0 ...
$ New.cases : int 0 0 1 1 0 0 0 0 0 0 ...
$ New.deaths : int 0 0 0 0 0 0 0 0 0 0 ...
$ New.recovered : int 0 0 0 0 0 0 0 0 0 0 ...
As we can see the Death column type is character, so we need to change it to numeric in order to perform data visualization.
'data.frame': 4692 obs. of 10 variables:
$ Date : chr "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
$ Name.of.State...UT : chr "Kerala" "Kerala" "Kerala" "Kerala" ...
$ Latitude : num 10.9 10.9 10.9 10.9 10.9 ...
$ Longitude : num 76.3 76.3 76.3 76.3 76.3 ...
$ Total.Confirmed.cases : num 1 1 2 3 3 3 3 3 3 3 ...
$ Death : num 0 0 0 0 0 0 0 0 0 0 ...
$ Cured.Discharged.Migrated: num 0 0 0 0 0 0 0 0 0 0 ...
$ New.cases : int 0 0 1 1 0 0 0 0 0 0 ...
$ New.deaths : int 0 0 0 0 0 0 0 0 0 0 ...
$ New.recovered : int 0 0 0 0 0 0 0 0 0 0 ...
we will look for the missing values in the whole dataset. We will perform data cleaning step for every subset of the dataset we take to answer a question.
Date Name.of.State...UT Latitude
0 0 0
Longitude Total.Confirmed.cases Death
0 0 1
Cured.Discharged.Migrated New.cases New.deaths
0 0 0
New.recovered
0
The Death column has null value of one row. And we fill that with 0.
Here we are creating new variable df2 by using the mutate function to Converting the Date column into three separate columns(Day/Month/Year).We are doing this to check the average of Death and New cases column by month.
Date Name.of.State...UT Latitude Longitude Total.Confirmed.cases Death
1 2020-01-30 Kerala 10.8505 76.2711 1 0
2 2020-01-31 Kerala 10.8505 76.2711 1 0
3 2020-02-01 Kerala 10.8505 76.2711 2 0
4 2020-02-02 Kerala 10.8505 76.2711 3 0
5 2020-02-03 Kerala 10.8505 76.2711 3 0
6 2020-02-04 Kerala 10.8505 76.2711 3 0
Cured.Discharged.Migrated New.cases New.deaths New.recovered day month year
1 0 0 0 0 2020 1 30
2 0 0 0 0 2020 1 31
3 0 1 0 0 2020 2 1
4 0 1 0 0 2020 2 2
5 0 0 0 0 2020 2 3
6 0 0 0 0 2020 2 4
Drop the Date column from df2
Name.of.State...UT Latitude Longitude Total.Confirmed.cases Death
1 Kerala 10.8505 76.2711 1 0
2 Kerala 10.8505 76.2711 1 0
3 Kerala 10.8505 76.2711 2 0
4 Kerala 10.8505 76.2711 3 0
5 Kerala 10.8505 76.2711 3 0
6 Kerala 10.8505 76.2711 3 0
Cured.Discharged.Migrated New.cases New.deaths New.recovered day month year
1 0 0 0 0 2020 1 30
2 0 0 0 0 2020 1 31
3 0 1 0 0 2020 2 1
4 0 1 0 0 2020 2 2
5 0 0 0 0 2020 2 3
6 0 0 0 0 2020 2 4
Q1: The average number of new cases and deaths over each month.
Q2: which top 10 states has the average number of new cases and deaths.
In order to see the number New Cases and deaths per month we have split the DATE column into year, month and day. To group by month and take the mean of the observations.
# A tibble: 8 x 3
month New.cases Death
<dbl> <dbl> <dbl>
1 1 0 0
2 2 0.0690 0
3 3 2.63 0.408
4 4 33.8 14.0
5 5 139. 87.0
6 6 377. 300.
7 7 1110. 748.
8 8 1551. 1102.
| month | New.cases | Death |
|---|---|---|
| 1 | 0.0000000 | 0.0000000 |
| 2 | 0.0689655 | 0.0000000 |
| 3 | 2.6323232 | 0.4080808 |
| 4 | 33.8271078 | 13.9658485 |
| 5 | 139.1443798 | 87.0087209 |
| 6 | 377.3152709 | 299.9546798 |
| 7 | 1110.1172840 | 748.3755144 |
| 8 | 1550.7904762 | 1102.1047619 |
To see the mean New Cases and deaths of top 10 state. We group by state and take the mean the observations.
# A tibble: 10 x 3
Name.of.State...UT New.cases Death
<chr> <dbl> <dbl>
1 Delhi 911. 1112.
2 Gujarat 490. 1013.
3 Karnataka 1030. 348.
4 Madhya Pradesh 265. 351.
5 Maharashtra 3185. 3998.
6 Tamil Nadu 1835. 750.
7 Telangana 1348. 344.
8 Telangana*** 0 455
9 Uttar Pradesh 687. 375.
10 West Bengal 655. 398.
| Name.of.State…UT | New.cases | Death |
|---|---|---|
| Delhi | 910.5909 | 1111.5390 |
| Gujarat | 490.1985 | 1013.1618 |
| Karnataka | 1030.2585 | 348.4422 |
| Madhya Pradesh | 264.6667 | 351.4148 |
| Maharashtra | 3185.4898 | 3997.6054 |
| Tamil Nadu | 1835.2953 | 750.1007 |
| Telangana | 1347.6471 | 343.8824 |
| Telangana*** | 0.0000 | 455.0000 |
| Uttar Pradesh | 686.7237 | 374.7303 |
| West Bengal | 654.6797 | 398.0703 |
The Death column has null value of one row. And we fill that with 0.
The dataset is the downloaded from Kaggle. It is the Covid dataset the states in India with variables that cover those tested positive to the virus, denoted as confirmed cases, deaths, Dates virus detected, the number recoveries, geographic locations denoted longitude and latitude and more. It gives a rich historical record the document to analyse and draw insight from.
The graph below summarises the figure of covid cases per state. The insight from the graph below is that the spread of Covid-19, across India,is widespread and the number case are not far between except for two or three states.
The graph below attempts to investigate the pattern in the implied distribution of Total Corfirmed Cases as it relates to those that have survived and recovered from the virus. The implication of the relationship is that the recoveries seems to be suggests that as covi cases rise, recoveries drop. The likely explanation is that new variants slow recoveries.
The graphy below shows covid cases and the outliers. Some states regular or arithmetic increases while other showing geometric in denoted by the number of outliers they posted above the box plots. The sudden jump in number maybe explained by population density,
The bar graph of total cases in indian states and according number of cases how many death and cured number of cases shown below graph. As shown in graph Maharashtra had over 10million number of cases between 30-01-2020 to 06-08-2020 which is highest number of cases overall to compare to other states and also active case, cured cases and death cases is around 6.458M, 8.146M and 587.648k number of cases which is more than to compare different stats.A second highest is Tamilnadu for cases of active,death and cured. Union Territory of chandigarh has only 2 active cases and 0 deaths and cured cases. As following data also find 0 death and 0 cured cases in Union Territory of ladakh and Union territory of jammu and kashmir.
As below graph is showing death rate and recovery rate by states wise and comparison of each other for confirmed cases, cured rate and death rate. In this scenario Gujarat is most death rate compare to other states which is 5% death rate. Telangana** has highest cured rate of cases which is 76.9%. Mizoram and Puducherry has 0% death rate total out of 13335 and 82967 confirmed cases. Meghalaya is one of the states which has minimum cured rate it is 28.1%. As to comparison clearly show which states has how much rate of death, cured and active cases.
A scatter plot describe date wise how many number of cases active, cured and death. As per dataset 23-05-2020 date has 1.378M total number of cases and after that cured cases is surpass of active cases. Total number of death at 23-05-2020 is 67.317k. At the end of 06-08-2020 total cured cases are around 32.413M out of all confirmed cases. As clear plot describe the simply way date wise growing cured cases of corona virus in India.